AI alignment

In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances its intended objectives. A misaligned AI system may pursue some objectives, but not the intended ones.^[1]

It is often challenging for AI designers to align an AI system due to the difficulty of specifying the full range of desired and undesired behaviors. To aid them, they often use simpler proxy goals, such as gaining human approval. But that approach can create loopholes, overlook necessary constraints, or reward the AI system for merely appearing aligned.^[1]^[2]

Misaligned AI systems can malfunction and cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (reward hacking).^[1]^[3]^[4] They may also develop unwanted instrumental strategies, such as seeking power or survival because such strategies help them achieve their final given goals.^[1]^[5]^[6] Furthermore, they may develop undesirable emergent goals that may be hard to detect before the system is deployed and encounters new situations and data distributions.^[7]^[8]

Today, these problems affect existing commercial systems such as language models,^[9]^[10]^[11] robots,^[12] autonomous vehicles,^[13] and social media recommendation engines.^[9]^[6]^[14] Some AI researchers argue that more capable future systems will be more severely affected, since these problems partially result from the systems being highly capable.^[15]^[3]^[2]

Many of the most-cited AI scientists,^[16]^[17]^[18] including Geoffrey Hinton, Yoshua Bengio, and Stuart Russell, argue that AI is approaching human-like (AGI) and superhuman cognitive capabilities (ASI) and could endanger human civilization if misaligned.^[19]^[6]

AI alignment is a subfield of AI safety, the study of how to build safe AI systems.^[20] Other subfields of AI safety include robustness, monitoring, and capability control.^[21] Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking.^[21] Alignment research has connections to interpretability research,^[22]^[23] (adversarial) robustness,^[20] anomaly detection, calibrated uncertainty,^[22] formal verification,^[24] preference learning,^[25]^[26]^[27] safety-critical engineering,^[28] game theory,^[29] algorithmic fairness,^[20]^[30] and social sciences.^[31]

^ ^a ^b ^c ^d Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022.
^ ^a ^b Ngo, Richard; Chan, Lawrence; Mindermann, Sören (2022). "The Alignment Problem from a Deep Learning Perspective". International Conference on Learning Representations. arXiv:2209.00626.
^ ^a ^b Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.
^ Zhuang, Simon; Hadfield-Menell, Dylan (2020). "Consequences of Misaligned AI". Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc. pp. 15763–15773. Retrieved March 11, 2023.
^ Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ ^a ^b ^c Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.
^ ^a ^b Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
^ Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.
^ Kober, Jens; Bagnell, J. Andrew; Peters, Jan (September 1, 2013). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843. Archived from the original on October 15, 2022. Retrieved September 12, 2022.
^ Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (March 1, 2023). "Reward (Mis)design for autonomous driving". Artificial Intelligence. 316: 103829. arXiv:2104.13906. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702. S2CID 233423198.
^ Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107. S2CID 226254676.
^ Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1003. ISBN 978-0-13-461099-3.
^ Bengio, Yoshua; Hinton, Geoffrey; Yao, Andrew; Song, Dawn; Abbeel, Pieter; Harari, Yuval Noah; Zhang, Ya-Qin; Xue, Lan; Shalev-Shwartz, Shai (November 12, 2023), Managing AI Risks in an Era of Rapid Progress, arXiv:2310.17688
^ "Statement on AI Risk | CAIS". www.safe.ai. Retrieved February 11, 2024.
^ Grace, Katja; Stewart, Harlan; Sandkühler, Julia Fabienne; Thomas, Stephen; Weinstein-Raun, Ben; Brauner, Jan (January 5, 2024), Thousands of AI Authors on the Future of AI, arXiv:2401.02843
^ Smith, Craig S. "Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'". Forbes. Retrieved May 4, 2023.
^ ^a ^b ^c Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ ^a ^b Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research – Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ ^a ^b Rorvig, Mordechai (April 14, 2022). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^
Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].
- Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.
^ Russell, Stuart; Dewey, Daniel; Tegmark, Max (December 31, 2015). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. arXiv:1602.03506. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. S2CID 8174496. Archived from the original on February 2, 2023. Retrieved September 12, 2022.
^ Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.
^ Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.
^ Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (March 7, 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG].
^
Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Archived from the original on January 1, 2023. Retrieved July 18, 2022.
- Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.
^ Prunkl, Carina; Whittlestone, Jess (February 7, 2020). "Beyond Near- and Long-Term". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM. pp. 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673. Archived from the original on October 16, 2022. Retrieved September 12, 2022.
^ Irving, Geoffrey; Askell, Amanda (February 19, 2019). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[aima4-1] Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022.

[dlp2023-2] Ngo, Richard; Chan, Lawrence; Mindermann, Sören (2022). "The Alignment Problem from a Deep Learning Perspective". International Conference on Learning Representations. arXiv:2209.00626.

[mmmm2022-3] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.

[4] Zhuang, Simon; Hadfield-Menell, Dylan (2020). "Consequences of Misaligned AI". Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc. pp. 15763–15773. Retrieved March 11, 2023.

[Carlsmith2022-5] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[:2102-6] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[Christian2020-7] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[gmdrl-8] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.

[Opportunities_Risks-9] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[feedback2022-10] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].

[OpenAICodex-11] Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.

[12] Kober, Jens; Bagnell, J. Andrew; Peters, Jan (September 1, 2013). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843. Archived from the original on October 15, 2022. Retrieved September 12, 2022.

[13] Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (March 1, 2023). "Reward (Mis)design for autonomous driving". Artificial Intelligence. 316: 103829. arXiv:2104.13906. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702. S2CID 233423198.

[14] Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107. S2CID 226254676.

[AIMA-15] Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1003. ISBN 978-0-13-461099-3.

[16] Bengio, Yoshua; Hinton, Geoffrey; Yao, Andrew; Song, Dawn; Abbeel, Pieter; Harari, Yuval Noah; Zhang, Ya-Qin; Xue, Lan; Shalev-Shwartz, Shai (November 12, 2023), Managing AI Risks in an Era of Rapid Progress, arXiv:2310.17688

[17] "Statement on AI Risk | CAIS". www.safe.ai. Retrieved February 11, 2024.

[18] Grace, Katja; Stewart, Harlan; Sandkühler, Julia Fabienne; Thomas, Stephen; Weinstein-Raun, Ben; Brauner, Jan (January 5, 2024), Thousands of AI Authors on the Future of AI, arXiv:2401.02843

[:2-19] Smith, Craig S. "Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'". Forbes. Retrieved May 4, 2023.

[concrete2016-20] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[building2018-21] Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research – Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[:333-22] Rorvig, Mordechai (April 14, 2022). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[23] Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].
Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.

[24] Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.

[24] Russell, Stuart; Dewey, Daniel; Tegmark, Max (December 31, 2015). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. arXiv:1602.03506. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. S2CID 8174496. Archived from the original on February 2, 2023. Retrieved September 12, 2022.

[prefsurvey2017-25] Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.

[drlfhp-26] Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.

[LessToxic-27] Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[28] Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (March 7, 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG].

[29] Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Archived from the original on January 1, 2023. Retrieved July 18, 2022.
Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.

[31] Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.

[30] Prunkl, Carina; Whittlestone, Jess (February 7, 2020). "Beyond Near- and Long-Term". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM. pp. 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673. Archived from the original on October 16, 2022. Retrieved September 12, 2022.

[:4-31] Irving, Geoffrey; Askell, Amanda (February 19, 2019). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]